Building upon the contribution of Brunori, Hufe, and Mahler (2018) the goal of our project is to estimate inequality of opportunity (IOP) based on Machine Learning (ML) techniques. The measurement of unequal opportunities goes back to Roemer (1998), who states that factors controlling individual outcomes (e.g income) consist of two categories: effort and circumstances. Effort summarizes all aspects over which individuals have control, while circumstances include all factors individuals cannot control. Individuals identified by the same exogenous circumstances and therefore characterized by similar background conditions belong to a circumstance type Roemer (1998). In what follows we want to analyze between-type disparities.
In the previous literature several methods have been applied to measure IOP. These include parametric, non-parametric and latent variable approaches (Roemer and Trannoy 2015). The reasons we rely upon ML methods is that, as outlined by Brunori, Hufe, and Mahler (2018), the empirical results are sensitive to model selection determined by the researcher. As largely discussed in the literature, model selection and thus partial observability of the real number of exogenous variables causes a downward bias in estimating opportunity estimates (Ferreira and Gignoux 2011). With ML methods it is possible to circumvent problems like sample size limitations or non-fixed and non-additive effects of circumstances and thus balance the bias (Bourguignon, Ferreira, and Menéndez 2007); (Ferreira and Gignoux 2011). To be more precise in our empirical analysis we make use of conditional inference trees and conditional inference forests, which belong to the classification and regression tree (CART) methods popularized by Breiman et al. (1984). CART methods lower the risk of arbitrary model selection by using an automated algorithm that splits the predictor space into non-overlapping areas to establish the best model for predicting the outcome variable. In the case of equality of opportunity the algorithm partitions the sample of respondents into different types. Moreover, the conditional inference algorithm employs sequences of hypothesis tests which restrains model overfitting and thus trades off upward and downward biases (Torsten Hothorn 2006). Another advantage of CART methods is their intuitive graphical representation, which makes them easily accessible to a large audience.
The project is organized as follows: First, after properly assessing our data, we make use of the income and living conditions data of Austria in 2019 to estimate the Conditional Inference Trees and a Conditional Inference Forest for Austria. Second, we apply the same data wrangling and data analysis procedure to the synthetic EU-SILC 2011 data set. Here we estimate the trees and forest for 6 countries. In the last part, we compare the results across countries and conclude.
Libraries
start <- Sys.time() # measure time
library(bibtex) # citations
library(knitcitations); cleanbib()
cite_options(citation_format = "pandoc", check.entries=FALSE)
library(tidyverse)
library(readr) # import
library(rpart) # regression trees
library(rpart.plot) # regression tree plots
library(summarytools) # summary statistics
library(party) # ctree
library(partykit) # ctree
library(caret)
library(forecast)
library(ineq) # Gini
library(precrec) # ROC curves
library(corrplot) # Correlation plots
library(plotly) # interactive ggplot2 plots :D
library(DescTools) # Winsorization
In the first part of our data analysis we rely upon the data of Statistics Austria. This data contains the survey data of the income and living conditions for Austria in the year 2019. In addition, the data set includes an ad-hoc module with many of the intergenerational transmission variables needed to properly asses inequality of opportunity of the respondents.
# setting the data path
data_path ="./AT2019"
# accessing the data
data19 <- read.csv(file.path(data_path, "p_silc2019_ext.csv"), sep = ";")
data19_pID <- read.csv(file.path(data_path, "id_schluessel_r_ext.csv"), sep = ";") # personal ID
data19_h <- read.csv(file.path(data_path, "h_silc2019_ext.csv"), sep = ";") # household data
data19_hID <- read.csv(file.path(data_path, "id_schluessel_d_ext.csv"), sep = ";") # household ID
data19_h <- data19_h %>% select(hy020, Hid)
data19 <- data19 %>% left_join(data19_h, by = "Hid")
The first step of our data wrangling part is to select the variables of interest. These are based upon the list of circumstances chosen in Brunori, Hufe, and Mahler (2018). The circumstances include both characteristics describing the respondent and circumstances related to the intergenerational transmission of the respondent. Personal characteristics are: sex and country of birth. Intergenerational circumstances include: the presence of parents at home, the number of adults present at home (aged 18 and over), the number of working adults present and the number of children (under 18) present at home, all when the respondent was at age 14. Further inter-generational circumstances are the level of education of the respondents parents, their occupational status, main occupation and if they held a managerial position, their citizenship and the tenancy status.
As our outcome variable we use Total Disposable Household Income. Unfortunately the data provided by Statistik Austria does not contain original answers, but instead are aggregated following Eurostat standards. The applied income variable is net income, meaning post deductions of taxes and social insurance. In the Austria 2019 data set most negative incomes are replaced by 0 entries. However, the provided data is meant to be comparable with the officially publicized data by Statistik Austria.
After selecting the variables described from the original data set, we rename all the variables and save them to our main data set data19. Building on this data set we further limit our data by the age. We only include respondents aged between 30 and 59 since this captures the working age population. In the next step we drop all answers where the respondents refused or were not able to provide information concerning the intergenerational circumstances like e.g. father or mother citizenship. We do not do this for all variables since it would leave us with too little observations (e.g. dropping adults). Next we need to recode some of the variables from characters into factors or numeric variables in order to later calculate the conditional inference trees.
Import and rename variables:
data19 <- data19 %>% select(sex, hy020, P038004, P110000nu, P111010nu, alter, M009010, M010000, M014000, M016000, M017000, M020010, M021000, M025000, M027000, M028000, M004000, M001300, M001510, M003100, M001100, M001200, M001500) %>%
rename("net_income" = hy020, # total disposable household income
"inc_net" = P038004, # gross monthly income
"country_birth" = P110000nu, # country of birth of respondent
"citizenship" = P111010nu, # citizenship of respondent
"age" = alter, # age of respondent
"father_cit" = M009010, # citizenship of father at age 14
"father_edu" = M010000, # education of father at age 14 (höchster abschluss)
"father_occup_stat" = M014000, # occupational status of father at age 14
"father_occup" = M016000, # main occupation of father at age 14
"father_manag" = M017000, # managerial position of father at age 14
"mother_cit" = M020010, # citizenship of mother at age 14
"mother_edu" = M021000, # education of mother at age 14
"mother_occup_stat" = M025000, # occupational status of mother at age 14
"mother_occup" = M027000, # main occupation of mother at age 14
"mother_manag" = M028000, # managerial position of mother at age 14
"tenancy" = M004000, # tenancy at age 14
"children" = M001300, # number of children (under 18) in respondent’s household at age 14
"adults" = M001510, # number of adults (aged 18 or more) in respondent’s household
"adults_working" = M003100, # number of working adults (aged 18 or more) in respondent’s hhd.
"father_present" = M001100, # father present in respondent’s household at age 14
"mother_present" = M001200, # mother present in respondent’s household at age 14
"adults_present" = M001500, # adults present in respondent’s household at age 14
)
Filter the working age group as defined in (Brunori, Hufe, and Mahler 2018) and recode father_cit & mother_cit and father_manag & mother_manag:
data19 <- data19 %>%
filter(age %in% (30:60), mother_present >= 0, father_present >= 0) %>%
mutate("both_parents_present" = father_present + mother_present,
# 4 = none present, 3 = one present, 2 = both present
father_cit = ifelse(father_cit == 1, 1, 2),
# Austria = 1 & Other = 2
mother_cit = ifelse(mother_cit == 1, 1, 2),
# Austria = 1 & Other = 2
father_manag = ifelse(father_manag == 1, 1, ifelse(father_manag == 2, 0, NA)),
# yes = 1 & no = 0
mother_manag = ifelse(mother_manag == 1, 1, ifelse(mother_manag == 2, 0, NA)),
# yes = 1 & no =
father_edu = ifelse(father_edu == -5, NA, ifelse(father_edu ==-2, NA, father_edu)),
# -2 & -5 both recoded as NA
mother_edu = ifelse(mother_edu == -5, NA, ifelse(mother_edu ==-2, NA, mother_edu)),
# -2 & -5 both recoded as NA
father_occup = ifelse(father_occup == -5, NA, ifelse(father_occup == -2, NA, father_occup)),
# -5, -3 & -2 recoded as NA
mother_occup = ifelse(mother_occup == -5, NA, ifelse(mother_occup == -2, NA, mother_occup)),
)
Factorize categorical variable:
factor <- c("sex",
"country_birth",
"citizenship",
"father_cit",
"father_edu",
"father_occup_stat",
"father_occup",
"father_manag",
"mother_cit",
"mother_edu",
"mother_occup_stat",
"mother_occup",
"mother_manag",
"tenancy",
"both_parents_present",
"father_present",
"mother_present")
data19[factor] <- lapply(data19[factor], as.factor)
sapply(data19, class)
## sex net_income inc_net
## "factor" "integer" "integer"
## country_birth citizenship age
## "factor" "factor" "integer"
## father_cit father_edu father_occup_stat
## "factor" "factor" "factor"
## father_occup father_manag mother_cit
## "factor" "factor" "factor"
## mother_edu mother_occup_stat mother_occup
## "factor" "factor" "factor"
## mother_manag tenancy children
## "factor" "factor" "integer"
## adults adults_working father_present
## "integer" "integer" "factor"
## mother_present adults_present both_parents_present
## "factor" "integer" "factor"
Recode factor variables
data19cor <- data19 # for correlation plot later
levels(data19$sex) <- c("Male", "Female")
levels(data19$country_birth) <- c("Unknown", "Austria", "EU15", "EU12", "Yugo", "Turkey", "Other")
levels(data19$citizenship) <- c("Austria", "EU15", "EU12", "Yugo", "Turkey", "Other")
levels(data19$father_cit) <- c("Austria", "Other")
levels(data19$mother_cit) <- c("Austria", "Other")
levels(data19$both_parents_present) <- c("both", "one", "none")
levels(data19$father_manag) <- c("Yes", "No")
levels(data19$mother_manag) <- c("Yes", "No")
levels(data19$father_occup) <- c("Unable","Army", "Manager", "Professional", "Technician", "Clerical", "Service","Agri", "Craft", "Operator", "Elementary")
levels(data19$mother_occup) <- c("Unable", "Army", "Manager", "Professional", "Technician", "Clerical", "Service","Agri", "Craft", "Operator", "Elementary")
Education, occupation and occupational status are not recoded. Occupations for example are coded according to the major groups in ISCO-08: 1: Managers 2: Professionals 3: Technicians and associate professionals 4: Clerical support workers 5: Service and sales workers 6: Skilled agricultural, forestry and fishery workers 7: Craft and related trades workers 8: Plant and machine operators, and assemblers 9: Elementary occupations 0: Armed forces occupations
print(dfSummary(data19), method="render", style="grid", plain.ascii = F)
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | sex [factor] | 1. Male 2. Female |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | net_income [integer] | Mean (sd) : 54605.2 (33528.4) min < med < max: 0 < 50314 < 873013 IQR (CV) : 34377.5 (0.6) | 3257 distinct values | 4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | inc_net [integer] | Mean (sd) : 1413.8 (1229.9) min < med < max: -3 < 1480 < 13000 IQR (CV) : 2202 (0.9) | 608 distinct values | 4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | country_birth [factor] | 1. Unknown 2. Austria 3. EU15 4. EU12 5. Yugo 6. Turkey 7. Other |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | citizenship [factor] | 1. Austria 2. EU15 3. EU12 4. Yugo 5. Turkey 6. Other |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | age [integer] | Mean (sd) : 45.1 (8.4) min < med < max: 30 < 46 < 58 IQR (CV) : 14 (0.2) | 29 distinct values | 4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | father_cit [factor] | 1. Austria 2. Other |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | father_edu [factor] | 1. 0 2. 1 3. 2 4. 3 5. 4 6. 5 7. 6 8. 7 9. 8 10. 9 |
|
4590 (96.1%) | 184 (3.9%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | father_occup_stat [factor] | 1. -5 2. -2 3. 1 4. 2 5. 3 6. 4 |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | father_occup [factor] | 1. Unable 2. Army 3. Manager 4. Professional 5. Technician 6. Clerical 7. Service 8. Agri 9. Craft 10. Operator 11. Elementary |
|
4503 (94.3%) | 271 (5.7%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | father_manag [factor] | 1. Yes 2. No |
|
4372 (91.6%) | 402 (8.4%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | mother_cit [factor] | 1. Austria 2. Other |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | mother_edu [factor] | 1. 0 2. 1 3. 2 4. 3 5. 4 6. 5 7. 6 8. 7 9. 8 10. 9 |
|
4714 (98.7%) | 60 (1.3%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | mother_occup_stat [factor] | 1. -5 2. -2 3. 1 4. 2 5. 3 6. 4 |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | mother_occup [factor] | 1. Unable 2. Army 3. Manager 4. Professional 5. Technician 6. Clerical 7. Service 8. Agri 9. Craft 10. Operator 11. Elementary |
|
4690 (98.2%) | 84 (1.8%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | mother_manag [factor] | 1. Yes 2. No |
|
2988 (62.6%) | 1786 (37.4%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 17 | tenancy [factor] | 1. -3 2. -2 3. 1 4. 2 5. 3 |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 18 | children [integer] | Mean (sd) : 1.1 (0.5) min < med < max: -3 < 1 < 2 IQR (CV) : 0 (0.5) |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 19 | adults [integer] | Mean (sd) : -2.1 (1.9) min < med < max: -3 < -3 < 15 IQR (CV) : 0 (-0.9) |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 20 | adults_working [integer] | Mean (sd) : 1.8 (1.2) min < med < max: -3 < 2 < 14 IQR (CV) : 1 (0.7) | 15 distinct values | 4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 21 | father_present [factor] | 1. 1 2. 2 |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 22 | mother_present [factor] | 1. 1 2. 2 |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 23 | adults_present [integer] | Mean (sd) : 1.6 (0.9) min < med < max: -3 < 2 < 2 IQR (CV) : 0 (0.5) |
|
4774 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 24 | both_parents_present [factor] | 1. both 2. one 3. none |
|
4774 (100.0%) | 0 (0.0%) |
Generated by summarytools 0.9.8 (R version 4.0.3)
2021-02-23
In order to get a first glimpse on how high or low inequality in general is in Austria we calculate and visualize the Gini coefficient.
ineq(data19$net_income, type = "Gini")
## [1] 0.2940572
The Gini index is 0.29 which is a bit lower than the World Bank estimate for Austria of 0.3 (2017) available at https://data.worldbank.org/indicator/SI.POV.GINI?locations=AT.
plot(Lc(data19$net_income), col = "darkred", lwd = 3)
The Gini index corresponds to the are below the the black equal distribution line and above the red line of the actual distribution.
Inequality statistics tend to be heavily influenced by outliers, thus we first apply the Winsorization function to our data (see Brunori and Neidhoefer 2020, p12). The winsorization sets all non-positive incomes equal to 1 and scales back all incomes exceeding the 95.5th percentile of the income distribution to that same threshold. Furthermore, we transform the income variable by taking the log. This means that ultimately our predicted output income, is the exponential of the predicted log income.
quantile <- quantile(data19$net_income, weights = NULL, probs = seq(0, 1, 0.005), na.rm = FALSE, names = TRUE, type = 7) %>% tail(2)
data19 <- data19 %>% mutate(income = Winsorize(data19$net_income, minval = 1, maxval = quantile[1], probs = c(0.05, 0.95),
na.rm = FALSE, type = 7), inc_log = log(income))
agepyra <- ggplot(data19, aes(x = age, fill= sex)) +
geom_bar(data = subset(data19, sex=="Female")) +
geom_bar(data = subset(data19, sex=="Male"), aes(y=..count..*(-1))) +
scale_x_continuous(breaks = seq(30,60,2), labels = abs(seq(30,60,2))) +
scale_fill_manual(name = "Sex", labels = c("Male", "Female"), values=c("springgreen2", "slateblue1")) +
labs(title = "Age pyramide of ad-hoc module on intergenerational transmission of disadvantages", x = "Age", y = "Number of people") +
theme_bw() +
coord_flip()
ggplotly(agepyra)
### Correlation plot
## One with all numeric variables and one with all categorical variables using CramerV()
##################
# <- sapply(data19cor, as.numeric)
#data19cor$sex <- as.numeric(data19cor$sex)
#data19cor$country_birth <- as.numeric(data19cor$country_birth)
# Dropping the categorical variables father_occup & mother_occup
#data19cor <- select(data19cor, -c(father_occup, mother_occup))
# Computing correlation coefficients and significance thereof
#data19cor <- cor(data19cor)
#res1 <- cor.mtest(data19cor, conf.level = 0.99)
#corrplot(data19cor, method = "ellipse", type = "upper", order = "FPC", diag = FALSE, outline = FALSE, tl.cex = .5, tl.col = "black", title = "Correlation plot", p.mat = res1$p, sig.level = 0.01, insig = "blank", mar=c(2,2,2,2))
As can be seen from the correlation plot, all variables are significantly related to at least one other variable of the data set (at the 1% significance level). For better visibility insignificant correlations are blanked out. As the correlation matrix is ordered using the first principal component there is some clustering of significant correlations.
To estimate equality of opportunity we let an automated algorithm decide the partition of the population into mutually exclusive types, in order to obtain a measure of inequality of opportunity. We follow the procedure described by Brunori, Hufe, and Mahler (2018). We show our results using both classification and regression trees and conditional inference trees. We put more emphasis on the latter. Conditional inference trees and conditional inference forests are a technique developed and described by (Torsten Hothorn 2006). We break down the main characteristics for our purposes:
The essential R function we use are: - ctree from party package in R - recursive partitioning just like rpart - rpart: maximizing an information measure - ctree: significance test procedure - caret: for additional cross validation to ctree_control
Advantages of Trees: Next to being rather straightforward to interpret using such an algorithm minimizes the degree of randomness and arbitrariness in model selection. Trees show outcome variability without initially assuming which circumstances play a significant role in shaping the individual opportunities or how the interact (Brunori, Hufe, and Mahler 2018).
Advantages of Trees over linear regression models: very large set of observations can be used & model specification is no longer exogenously given.
Advantages of Conditional Inference Trees over Regression and Classification Trees (CART): the algorithm automatically provides a test for the null hypothesis of equality of opportunity and prevents overfitting while CART “cannot distinguish between a significant and an insignificant improvement in the information measure” (Mingers 1987, as cited in Torsten Hothorn (2006), 2) and consider the distributional properties of the measures. Since the algorithm avoids upwards and downwards biases, the estimates obtained are better suited for comparisons across time (i.e. Austria 2011 to Austria 2019) and across countries (EU-SILC) even when samples sizes are different (Brunori, Hufe, and Mahler 2018).
Procedure
Empirical approach as described in Brunori, Hufe, and Mahler (2018, 4): We consider a population size for each country of size \(N\) which is indexed by \(i \in \{1, ..., N \}\) and a vector of incomes \(Y=\{y_1,...,y_i,...,y_N \}\). Our assumption is that each individual i’s income is the result of two sets of factors. A set of circumstances, which are beyond her control and for which we have observations of size \(P: \Omega_i =\{ C^1_i, ..., C^p_i, ..., C^P_i\}\). Then, there is a set of efforts, which we do not observe, of size \(Q: \Theta_i = \{E^1_i, ..., E^q_i, ..., E^Q_i \}\). This results in a very general outcome generating function \(g: \Omega \times \Theta \rightarrow \mathbb{R}_+\) or \[y_i = g(\Omega_i, \Theta_i) \]. Each circumstance \(C^p \in \Omega\) has a total of \(X^p\) realizations and each one is denoted as \(x^p\). The conditional inference trees partition the population into a set of non-overlapping types, whereby a type is a subgroup of the original population in therms of circumstances. We have type \(T=\{t_1, ..., t_m,...,t_M \}\) and invidiuals i and j belong to the same type as in: \(t_m \in T\) if \(x^p_i = x^p_j \forall C^p \in \Omega\). Likewise, they belong to different types \(t_m \in T\) if \(\exists C^p \in \Omega : x^p_i \ne x^p_j\). Types define a particular way of partitioning the population into subgroups, and group membership indicates uniformity in circumstances (types). In essence this means that the approach we utilize here is an ex-ante view that focuses on between-type differences in the value of opportunity sets without paying attention to the effort realizations of individuals. The tree-based method obtains the predictions for our outcome y as a function of the input variables I, our observed circumstances. The method uses the set of variables to partition the population into a set of non-overlapping groups, \(G= \{g_1,...,g_m,...,g_M \}\) and each group is homogeneous in expressing each input variable. Graphically these groups are identified as terminal nodes or leafs. The tree also gives us the predicted outcome value per observation. This means that in addition to the observed income vector \(Y=\{y_1,...,y_i,...,y_N \}\) we also obtain a vector of predicted values \(\hat{Y}=\{\hat{y}_1,...,\hat{y}_i,...,\hat{y}_N \}\) where \[\hat{y}_i = \mu_m = \frac {1}{N_M} \sum_{i \in g_m} y_i, \forall i \in g_m, \forall g_m \in G\]. The interpretation of the regression trees is then that conditional on the input variables being circumstances only (\(I \subseteq \hat{\Omega} \subseteq \Omega\)) each resulting group \(g_m \in G\) can be interpreted as a circumstance type \(t_m \in T\). Importantly the predicted value \(\hat{Y}\) is analogous to the smoothed distribution of \(Y\) and is our prediction of equal incomes within a group.
The algorithm of conditional inference trees follows a step-wise procedure of permutation tests as described by (Brunori, Hufe, and Mahler 2018, 7–8):
- Choose confidence level Test the null hypothesis of independence, \(H_0^{C^p} : D(Y|C^P) = D(Y)\), for each input variable \(C^P \in \hat{\Omega}\), and obtain a p-value associated with each test, \(p^{C^p}\). \(\implies\) We adjust the p-values for multiple hypothesis testing, such that \(p_{adj.}^{C^p} = 1-(1-p^{Cp})^P\), which essentially means that we can use the so called Bonferroni Correction.
- Choose feature: test all the null hypotheses of independence between the individual outcome and each of all the observable circumstances (variables). The model selects a variable, \(C^*\), with the lowest adjusted p-value. Essentially we choose such that \(C^* = \{C^P : \text{argmin} ~ p_{adj.}^{C^p} \}\).
- no hypothesis can be rejected: stop \(\implies\) If \(p_{adj.}^{C^p} > \alpha\): Exit the algorithm.
- one or more circumstance is siginificant: select the circumstance with the smallest p-value and proceed \(\rightarrow\) If \(p_{adj.}^{C^p} \leq \alpha\): Continue, and select \(C^*\) as the splitting variable.
- Choose split: for every possible way the selected circumstance can divide the sample into two subgroups, test the hypothesis of same mean outcome in the two resulting subgroups. Choose the splitting point with the smallest p-value. Technically, we test the discrepancy between the subsamples for each possible binary partition, s, based on \(C^*\), meaning that \(Y_s = \{Y_i : C^*_i < x^p \}\) and \(Y_{-s} = \{Y_i : C^*_i \geq x^p \}\), and obtain a p-value associated with each test, \(p^{C^*_s}\).
\(\implies\) The the Split sample based on \(C^*_s\), by choosing the split point s that yields the lowest p-value, which is \(C^*_s = \{C^*_s : \text{argmin} ~ p^{C^*_s} \}\).
- Repeat :)
In the context of estimating inequality of opportunity conditional inference trees offer a particular structure. Namely each hypothesis is a test for whether equal opportunity exists within a group. If the tree results in no splits we cannot reject the null hypothesis of equality of opportunity. While the deeper the tree is grown, the more types are required to account for inequality of opportunity in the country under consideration. Each split (parent node) thus indicates that the opportunities of the two groups are significantly different, while we cannot say the same for the groups included in the leaf. nodes.
To showcase the difference between the regression and classification trees we discussed in class and the conditional inference trees we also plot the former as comparison. In the following chunk of code we use set.seed to generate randomness for reproducability. We define our formula which consists of all the circumstances we use for estimation. Furthermore, we split the data into a training and test sample. Finally we define a fitControl which is our tuning function for cross validation using the caret package.
set.seed(12345)
formula = inc_log ~ sex + country_birth + father_cit + father_edu + father_occup_stat + father_occup + father_manag + mother_cit + mother_edu + mother_occup_stat + mother_occup + mother_manag + tenancy + children + adults_working + both_parents_present
data19 <- data19 %>%
mutate(train_index = sample(c("train", "test"), nrow(data19), replace=TRUE, prob=c(0.67, 0.33)))
train <- data19 %>% filter(train_index=="train")
test <- data19 %>% filter(train_index=="test")
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, savePredictions = T)
tuning_grid <- expand.grid(cp = seq(0, 0.02, by= 0.005))
tuning_grid
## cp
## 1 0.000
## 2 0.005
## 3 0.010
## 4 0.015
## 5 0.020
caret_rpart <- train(formula, data = train, method = "rpart", trControl = fitControl, tuneGrid = tuning_grid, metric = "RMSE", na.action = na.pass)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.
caret_rpart
## CART
##
## 3165 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 2849, 2849, 2849, 2848, 2848, 2849, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.000 0.7316886 0.009435384 0.5040564
## 0.005 0.6836426 0.021320797 0.4577039
## 0.010 0.6791330 0.024618770 0.4557362
## 0.015 0.6763654 0.025883373 0.4548592
## 0.020 0.6800216 0.016833513 0.4572333
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.015.
tree_caret_final <- caret_rpart$finalModel
rpart.plot(tree_caret_final, box.palette="RdBu", nn=FALSE, type=2)
The
caret_tree for Austria shows a tree with one partition and two terminal nodes. The splitting variables indicate already which circumstances are most significant in determining the income of the respondents. In Austria, the citizenship of the father appears to be the most important determinant for income. However, we only showcase this tree as a comparison to the conditional inference trees.
The conditional inference tree algorithm as it is included in party and the partykit packages contains various points for adjustment of variable selection and stopping criteria. We use the default ctree_control function but see it necessary to explain what it does exactly and why we think that the specifications we have chosen are not distorting. We use the default teststatistic as we do not know neither the conditional expectation nor covariance of our circumstances. In such a case the default setting of ctree_control(teststat = "quad") is recommended (Torsten Hothorn 2006). The Austria 2019 data set has been mostly cleaned of missing values, however we still have many NA entries further on in the document. For reasons of uniformity, we chose to use the testtype ctree_control(testtype = "Univariate"). This approach uses simple P-Values. However, we also use the caret package for cross validation in addition to the default control function, since it is the one we had discussed in class and we used it for comparability of the results.
#As a first step we grow an unrpuned tree
Ctree <- ctree(formula, data = train, control = ctree_control(testtype = "Univariate"))
Ctree
##
## Model formula:
## inc_log ~ sex + country_birth + father_cit + father_edu + father_occup_stat +
## father_occup + father_manag + mother_cit + mother_edu + mother_occup_stat +
## mother_occup + mother_manag + tenancy + children + adults_working +
## both_parents_present
##
## Fitted party:
## [1] root
## | [2] father_cit in Austria
## | | [3] father_occup_stat in -5, 1, 2, 4
## | | | [4] mother_edu in 0, 1, 2, 4, 7: 10.774 (n = 2016, err = 892.5)
## | | | [5] mother_edu in 3, 5, 6, 8, 9
## | | | | [6] both_parents_present in both: 10.920 (n = 412, err = 125.0)
## | | | | [7] both_parents_present in one, none: 10.725 (n = 66, err = 21.8)
## | | [8] father_occup_stat in -2, 3
## | | | [9] mother_occup_stat in -5, -2, 2: 8.732 (n = 7, err = 91.5)
## | | | [10] mother_occup_stat in 1, 3, 4: 10.407 (n = 26, err = 3.8)
## | [11] father_cit in Other
## | | [12] tenancy in -3, -2, 1: 10.419 (n = 230, err = 149.9)
## | | [13] tenancy in 2, 3
## | | | [14] country_birth in Unknown, EU12, Yugo, Other
## | | | | [15] both_parents_present in both, one: 10.546 (n = 256, err = 101.0)
## | | | | [16] both_parents_present in none: 10.019 (n = 10, err = 5.7)
## | | | [17] country_birth in Austria, EU15, Turkey
## | | | | [18] father_manag in Yes: 10.682 (n = 94, err = 29.4)
## | | | | [19] father_manag in No: 10.847 (n = 48, err = 20.9)
##
## Number of inner nodes: 9
## Number of terminal nodes: 10
mean((train$inc_log - predict(Ctree))^2) #MSE
## [1] 0.4554355
cor(predict(Ctree, newdata=test),test$inc_log)^2 #R-sq
## [1] 0.02605945
plot(Ctree, type = "simple",gp = gpar(fontsize = 6),
inner_panel=node_inner,
ip_args=list(id = FALSE), main = "Conditional Inference Tree for Austria 2019")
We obtain a deep tree with 11 inner nodes and 12 terminal nodes. In this estimate the citizenship of the father appears to be the first determining split. The predictive power of the model does not appear to be good, when we look at the errors indicated in the gray boxes. We obtain a MSE for the model of 0,45 and an \(R^2\) of 0,026. As an example for reading the tree, we point at terminal node
4, where we see that 2016 observations are grouped together. Going down the tree from the initial split, we can tell that this group has a father with Austrian citizenship, employed, a mother with variying levels of education. The predicted outcome for income for this group, which accounts for 64% of the whole training sample, is EUR 47762,69. Going, down the other split to node 6 we find a group accounting for 13% of the training sample. Here, the respondents are grouped together, where both parents were present. The predicted outcome for this group is EUR 55270,8.
In the next step, we cross-validate use cross validation for our prediction using the caret package, thus we also use the split of data (test, training). In addition we apply the ctree package to see how good the model is at predicting the income of our test data.
caret_ctree <- train(formula, data = train, method = "ctree", trControl = fitControl, na.action = na.pass)
caret_ctree
## Conditional Inference Tree
##
## 3165 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 2849, 2849, 2848, 2848, 2849, 2849, ...
## Resampling results across tuning parameters:
##
## mincriterion RMSE Rsquared MAE
## 0.01 0.6923217 0.01201291 0.4666870
## 0.50 0.6804699 0.02045711 0.4577135
## 0.99 0.6761019 0.02675760 0.4543686
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mincriterion = 0.99.
plot(caret_ctree$finalModel, type = "simple")
#Thats just the mean of the overall predictions, not how well the model actually fits
P_caret_ctree <- mean(predict(caret_ctree, newdata = test))
exp(P_caret_ctree) #Which is the predicted average income - and its quite off the actual mean income.
## [1] 46448.14
Above, we showcase the cross-validated tree obtained from the caret package. Hereby there is only one split, and the prediction for the whole population is just split into two groups. For whom the predicted income depends only on the citizenship of the father. The predicted incomes are EUR 47810,47 for node 2, and EUR 37421,47 for node 3.
In the next, step we again use the ctree_control function instead of cross-validation, but we use the suggested mincriterion we obtained from cross-validation.
caret_ctree_U <- ctree(formula, data = data19, control = ctree_control(testtype = "Univariate", alpha = 0.05, mincriterion = 0.99))
caret_ctree_U
##
## Model formula:
## inc_log ~ sex + country_birth + father_cit + father_edu + father_occup_stat +
## father_occup + father_manag + mother_cit + mother_edu + mother_occup_stat +
## mother_occup + mother_manag + tenancy + children + adults_working +
## both_parents_present
##
## Fitted party:
## [1] root
## | [2] father_cit in Austria
## | | [3] father_occup_stat in -5, 1, 2, 4
## | | | [4] mother_cit in Austria: 10.796 (n = 3712, err = 1541.8)
## | | | [5] mother_cit in Other
## | | | | [6] mother_occup_stat in -5, 1, 4: 10.705 (n = 42, err = 14.2)
## | | | | [7] mother_occup_stat in -2, 2, 3: 9.445 (n = 9, err = 101.3)
## | | [8] father_occup_stat in -2, 3
## | | | [9] mother_occup_stat in -5, -2, 2: 8.732 (n = 7, err = 91.5)
## | | | [10] mother_occup_stat in 1, 3, 4: 10.497 (n = 43, err = 11.3)
## | [11] father_cit in Other
## | | [12] children <= -3: 9.630 (n = 15, err = 107.2)
## | | [13] children > -3: 10.514 (n = 946, err = 630.3)
##
## Number of inner nodes: 6
## Number of terminal nodes: 7
mean((data19$inc_log - predict(caret_ctree_U))^2) #MSE
## [1] 0.5231771
cor(predict(caret_ctree_U, newdata=test),test$inc_log)^2 #R-sq
## [1] 0.07225173
plot(caret_ctree_U,gp = gpar(fontsize = 6),
inner_panel=node_inner,
ip_args=list(abbreviate = FALSE,id = FALSE), main = "Opportunity Conditional Inference Tree for Austria 2019 - Cross Validated")
We plot the conditional inference tree using both the
caret and the party package to compare the results. In the first we used the suggested mincriterion = 0.99 obtained through cross validation. Here, we showcase a tree with 6 inner and 7 terminal nodes. Again, the first split is made based on the citizenship of the father of the respondents. We do not showcase the simplified tree, but instead show the distribution of the outcome variable, visible through the box-plots at the terminal nodes of the tree. Again, we obtain widely different results, for different groups in the population. Our MSE for this model is 0.52 and the \(R^2\) has increased to 0,07, which is still very low. As an example of a predicted outcome consider terminal node 13, where we have 946 observations grouped together. Here, respondents father does not have Austrian citizenship, and are likely to have larger families. However, all possible answers for family size or amount of children are grouped together in this node, which is likely why, the observations are so widely dispersed here. The predicted income for this node is EUR 36643.82. All, splits are chosen at a significance level of p < 0.001 which is also in line with the disciplinary convention for hypothesis tests (Brunori, Hufe, and Mahler 2018, 9). We obtain another large group in terminal node 4, where 3712 observations or 77% of all observations are grouped together. The citizenship of the mother, is the determining split variable, the among of variables it accounts for, already suggests that mother_cit could be a very important variable for determining income.
Following the descprition by Brunori, Hufe, and Mahler (2018), we change the testtype criterion to “Bonferroni” as it lets us avoid the problems of tree-pruning. The Bonferroni correction also helps with the bias-variance trade-off as it is a test criterion, that considers it before each additional split.
caret_ctree_B <- ctree(formula, data = data19, control = ctree_control(testtype = "Bonferroni", alpha = 0.05))
caret_ctree_B
##
## Model formula:
## inc_log ~ sex + country_birth + father_cit + father_edu + father_occup_stat +
## father_occup + father_manag + mother_cit + mother_edu + mother_occup_stat +
## mother_occup + mother_manag + tenancy + children + adults_working +
## both_parents_present
##
## Fitted party:
## [1] root
## | [2] father_cit in Austria
## | | [3] father_occup_stat in -5, 1, 2, 4
## | | | [4] mother_cit in Austria: 10.796 (n = 3712, err = 1541.8)
## | | | [5] mother_cit in Other: 10.483 (n = 51, err = 127.3)
## | | [6] father_occup_stat in -2, 3
## | | | [7] mother_occup_stat in -5, -2, 2: 8.732 (n = 7, err = 91.5)
## | | | [8] mother_occup_stat in 1, 3, 4: 10.497 (n = 43, err = 11.3)
## | [9] father_cit in Other
## | | [10] children <= -3: 9.630 (n = 15, err = 107.2)
## | | [11] children > -3: 10.514 (n = 946, err = 630.3)
##
## Number of inner nodes: 5
## Number of terminal nodes: 6
mean((data19$inc_log - predict(caret_ctree_B))^2) #MSE
## [1] 0.5256449
cor(predict(caret_ctree_B, newdata=test),test$inc_log)^2 #R-sq
## [1] 0.05028226
plot(caret_ctree_B,gp = gpar(fontsize = 6),
inner_panel=node_inner,
ip_args=list(abbreviate = FALSE,id = FALSE), main = "Opportunity Conditional Inference Tree for Austria 2019 - Bonferroni")
We obtain a tree with 5 inner and 6 terminal nodes. This tree has a MSE of 0,52 and \(R^2\) of 0,05. Meaning this tree is also rather bad at predicting income based on circumstances, but better than the trees we have estimated prior. A large share of our observations are grouped together into the two outer nodes, which also suggests that there is quite a lot of in-group variation - this is also confirmed by the errors. Further this suggests to us, that the conditional inference trees perform rather poorly in predicting incomes. However, we have no model to compare our results to.
The graph below shows how the P-Value Threshold is adjusted using the RMSE as an anchor. As the lowest RMSE is achieved using the strictest P-Value, that is the one we chose.
plot(caret_ctree) # RMSE vs p-value our resampling parameter
plot(caret_rpart)
# plotcp(tree_1)
The procedure and application of Conditional Inference Forests follows the application in Brunori, Hufe, and Mahler (2018, 10) As discussed the conditional inference trees construct as outcome the counterfactual distribution of the income variable. However, conditional inference trees only use limited information of the set of observed circumstances, since not all circumstances \(C^p \in \hat{\Omega}\) are utilized. Furthermore, the predictions (the values of the opportunity sets) have high variance. Conditional Inference Forests are able to deal with the shortcomings of conditional inference trees. The main difference between the forest and the tree approach is that in the forest each tree is estimated on a random subsample \(b\) of the original data. Thus, in total \(B\) trees are estimated. Furthermore, a random subset of circumstances is used at each splitting point. This guarantees that at some point all circumstances with any kind of informational value will be used as a splitting variable. Furthermore, averaging the result over \(B\) predictions reduces the variance. The individual predictions are a function of \(\alpha\) which stands for the significance level in charge of splits, \(\bar{P}\) i.e. the number of circumstances to be considered, and \(\bar{B}\) the number of subsamples.
cf <- cforest(formula, train, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.5), ntree = 500L, perturb = list(replace = T, fraction = 0.8))
class(cf)
## [1] "cforest" "constparties" "parties"
#We have a RMSE of 0, makes no sense at all!
data19$hat_cf <- predict(cf, newdata = as.data.frame(data19), OOB = TRUE, type = "response")
mean((data19$inc_log - predict(cf))^2) #MSE
## [1] 0.5669979
cor(predict(cf, newdata=test),test$inc_log)^2 #R-sq -> rather bad :(
## [1] 0.02519202
data19$pred_inc <- exp(data19$hat_cf)
data19$RMSE <- sqrt(sum((data19$income - data19$pred_inc)^2/nrow(data19), na.rm = T))
head(data19$RMSE)
## [1] 29501.47 29501.47 29501.47 29501.47 29501.47 29501.47
varimp(cf, mincriterion = 0, OOB = TRUE)
## sex country_birth father_cit
## 7.865194e-04 3.301931e-03 1.099378e-02
## father_edu father_occup_stat father_occup
## -9.441157e-04 3.851495e-03 1.268651e-03
## father_manag mother_cit mother_edu
## 8.410082e-04 8.355705e-03 9.498190e-04
## mother_occup_stat mother_occup mother_manag
## 5.504402e-04 1.264348e-03 -6.066113e-05
## tenancy children adults_working
## 2.310382e-03 8.203796e-04 1.127579e-03
## both_parents_present
## 1.068206e-03
importance_cf <- data.frame(varimp(cf, mincriterion = 0, OOB = TRUE))
names(importance_cf) <- "importance"
importance_cf$var_name = rownames(importance_cf)
importance_cf <- importance_cf %>% arrange( desc(importance))
We obtain a RMSE for the conditional inference forest of 0.48. Furthermore, we obtain a table of variable importance as identified through the forest, which we arrange in descending order and plot below:
varimpo <- importance_cf %>% ggplot(aes(x = var_name, y = importance)) +
geom_pointrange(shape = 21, colour = "black", fill = "white", size = 3, stroke = 1, aes(ymin = 0, ymax = importance)) +
scale_x_discrete(limits = importance_cf$var_name[order(importance_cf$importance)]) +
labs(title = "Conditional Forest variable importance - Austria 2019", x = "", y = "Mean decrease in sum of squared residuals") +
coord_flip() +
theme_light() +
theme(axis.line = element_blank(), panel.border = element_blank(), panel.grid.major.y=element_blank())
ggplotly(varimpo)
Both the conditional inference tree and the conditional inference forest analysis indicate that the citizenship of either parent, is the most important determinant of an individuals income in Austria. This result is in line with the predicted results by Brunori, Hufe, and Mahler (2018). We are puzzled by the fact, that some variables seem to have a negative effect on the mean decrease of SSR. However, over all our results, while not building any statistical confidence in our estimation approach, confirm the results discussed and point to citizenship origin as determining factors for opportunity in Austria.
In this part of the seminar paper, we attempt to reproduce the findings of (2018), but unfortunately we do not have access to the actual EU-SILC data from 2011. Instead we reproduce the findings using the synthetic data provided by the European office of statistics (Eurostat) (https://ec.europa.eu/eurostat/web/microdata/statistics-on-income-and-living-conditions).
The original data is not publicly provided as the EU protects the privacy of the original respondents. The idea of the public micro-data, is that it allows us to train and write the code using the actual variable names, but not obtaining true results. The EU-SILC public micro-data files are fully synthetic, they were simulated using statistical modeling, and they show the statistical distributions of the original data. The main caveats of this data are, that it cannot be used for statistical inference to the wider population. The results and conclusions obtained from this public micro-data are thus to be taken with a grain of salt. Luckily, the individual country data sets are grouped in a coherent manner. We use the EU-SILC data from 2011, as it was the survey when additionally to the usual questions, there were questions on inter-generational transmission. These were questions about the parents of the respondents. Unfortunately however, the data contains various implausible errors, which make it difficult to apply the ctree method to it. While the synthetic data is similarly distributed to the actual data its missing values are not in any way systematic but random. This makes, cleaning the data basically impossible, since otherwise we lose almost all of our observations. The data sets for Finland, Denmark, and Spain contain more missing values for most of the Ad-hoc questions than answers. And the Italian household data set, as it is provided on the Eurostat website only contains 138 observations, which makes merging it with the 20.000 observation long personal data set useless. Furthermore, the variable for total disposable income, contains many more negative values, than the actual data. This leads to factually wrong outcomes for the Gini coefficient. It is puzzling, as using the negative values we obtain plausible Gini coefficients for most of the countries, while these values make no sense for our further analysis.
The unique identifier used in all four data sets is the household ID identifier: RX030 in the Personal Register, PX030 in the Personal Data, DB030 in the Household Register, and HB030 in the Household Data file. We only need to combine two of the data sets, namely the Household Register and the Personal Data. Latter contains the Ad-hoc module with the questions on inter-generational characteristics.
Following (2018) we use the following variables for circumstances: Respondent’s sex (PB150), Respondent’s country of birth (Citizenship as proxy - PB220A), Presence of parents at home (PT010), Number of adults (18 or older) in respondents household (PT020), Number of working adults (18 or older) in respondents household (PT030), Father/Mother country of birth and citizenship (PT060, PT070, PT090, PT100), Father/mother education (PT110, PT120), Father/mother occupational status (PT130, PT160), Father/mother main occupation (PT150,PT180), Managerial position of father/mother(PT140,PT170), Tenancy status of the house in which respondent was living as a child (PT210).
Outcome Variable: Total Disposable Income (HY020), but we also experimented with other possible outcome variables.
We first use more variables than ultimately used in the analysis. We use the year of birth to calculate the age, and then exclude everyone older than 60 or younger than 30.
In the beginning, we ran the analysis with the citizenship variable included, but we ultimately decided that it is not really a circumstantial variable as respondents country of birth would have been. Since it is ultimately possible to obtain a new citizenship.
# setting the data path
data_path ="./SILC_2011"
# accessing the data
AT_personal_data <- read.csv(file.path(data_path, "AT_2011p_EUSILC.csv"))
AT_household_data <- read.csv(file.path(data_path, "AT_2011h_EUSILC.csv"))
# change the name of the identifier variable
AT_household_data <- AT_household_data %>% rename("PX030" = HB030)
# joining the data
AT_equality_data <- AT_personal_data %>% left_join(AT_household_data, by = "PX030")
# For shorter chunks of code we group some of the following data wrangling steps together
selection_f <- c("PB140", "HY020", "PB150", "PB220A", "PT010", "PT020", "PT030", "PT060", "PT070", "PT090", "PT100", "PT110", "PT120", "PT130", "PT160", "PT150", "PT180", "PT140", "PT170", "PT210")
factor_f <- c("citizenship", "sex", "parents_present", "father_cob", "father_cit", "father_edu", "mother_edu", "mother_cob", "mother_cit", "mother_occup", "father_occup_stat", "mother_occup_stat", "father_occup", "mother_manag","father_manag", "tenancy")
# Renaming important variables for readability of tree
AT_equality_data <- AT_equality_data %>% select(selection_f) %>% mutate(
age = (2011 - PB140),
) %>% filter(
age %in% (30:59), HY020 >= 0 # Since Statistik Austria, excludes negative entries for total disposable income, we do the same here!
) %>%
rename(
"year_of_birth" = PB140, # Respondents year of birth
"annual_income" = HY020, # Total Disposable Income
"citizenship" = PB220A, # Respondents citizenship
"sex" = PB150, # Respondents sex
"parents_present" = PT010, # Presence of parents (or those considered as such)
"adults_home" = PT020, # Number of adults living in the household when the respondent was 14 years old
"children_home" = PT030, # Number of children in the household
"father_cob" = PT060, # Country of birth of father
"father_cit" = PT070, # Citizenship of father
"mother_cob" = PT090, # Country of birth of mother
"mother_cit" = PT100, # Citizenship of mother
"father_edu" = PT110, # Highest level of education attained by father
"mother_edu" = PT120, # Highest level of education attained by mother
"father_occup_stat" = PT130, # Activity status of father
"mother_occup_stat" = PT160, # Activity status of mother
"father_occup" = PT150, # Main occupation (job) of father
"mother_occup" = PT180, # Main occupation (job) of mother
"father_manag" = PT140, # Managerial position of father
"mother_manag" = PT170, # Managerial position of mother
"tenancy" = PT210) # Tenancy status when respondent was 14 years old
AT_equality_data[factor_f] <- lapply(AT_equality_data[factor_f], as.factor)
sapply(AT_equality_data, class)
## year_of_birth annual_income sex citizenship
## "integer" "integer" "factor" "factor"
## parents_present adults_home children_home father_cob
## "factor" "integer" "integer" "factor"
## father_cit mother_cob mother_cit father_edu
## "factor" "factor" "factor" "factor"
## mother_edu father_occup_stat mother_occup_stat father_occup
## "factor" "factor" "factor" "factor"
## mother_occup father_manag mother_manag tenancy
## "factor" "factor" "factor" "factor"
## age
## "numeric"
We again use the Winsorization based on the 99.5th percentile of the income distribution. In the 2011 data set, we have a very high fraction of negative income entries, which are all changed to be equal to 1.
quantile_AT <- quantile(AT_equality_data$annual_income, weights = NULL, probs = seq(0, 1, 0.005), na.rm = FALSE, names = TRUE, type = 7) %>% tail(2)
AT_equality_data <- AT_equality_data %>% mutate(income = Winsorize(AT_equality_data$annual_income, minval = 1, maxval = quantile_AT[1], probs = c(0.05, 0.95), na.rm = FALSE, type = 7), inc_log = log(income))
For simplicity we only name the levels for Austrian circumstantial variabels:
levels(AT_equality_data$sex) <- c("Male", "Female")
levels(AT_equality_data$father_cit) <- c("Don't know","Resp. Present Country","Other EU country","Other European Country")
levels(AT_equality_data$mother_cit) <- c("Don't know","Resp. Present Country","Other EU country","Other European Country")
levels(AT_equality_data$father_cob) <- c("Don't know","Resp. Present Country","Other EU country","Other European Country")
levels(AT_equality_data$mother_cob) <- c("Don't know","Resp. Present Country","Other EU country","Other European Country")
levels(AT_equality_data$father_edu) <- c("Don't know", "None", "Low", "Medium", "High")
levels(AT_equality_data$mother_edu) <- c("Don't know", "None", "Low", "Medium", "High")
levels(AT_equality_data$father_manag) <- c("Don't know", "Supervisory", "Non-supervisory")
levels(AT_equality_data$mother_manag) <- c("Don't know", "Supervisory", "Non-supervisory")
levels(AT_equality_data$father_occup_stat) <- c("Don't know", "Employed", "Self-employed", "Unemployed", "Retired etc.", "Domestic", "Other inactive")
levels(AT_equality_data$mother_occup_stat) <- c("Don't know", "Employed", "Self-employed", "Unemployed", "Retired etc.", "Domestic", "Other inactive")
levels(AT_equality_data$tenancy) <- c("Don't know", "Owner", "Tenant", "Free Acc")
levels(AT_equality_data$father_occup) <- c("Don't know", "Armed forces", "Manager", "Professional", "Technician", "Clerical", "Service", "Agri", "Craft", "Operator", "Elementary")
levels(AT_equality_data$mother_occup) <- c("Don't know", "Armed forces", "Manager", "Professional", "Technician", "Clerical", "Service", "Agri", "Craft", "Operator", "Elementary")
Summary We provide the summary statistics for Austria, which we obtained using the ‘dfsummary’ from the package ‘summarytools’. Similar to the 2019 data set the ‘AT_equality_data’ does contain almost 7000 observations and no missing entries in our outcome variable annual income. However, it does contain many missing values across the observed circumstances. We chose to not exclude those and deal with these missing entries using the na.action = na.pass command when doing the statistical analysis.
print(dfSummary(AT_equality_data), method="render", style="grid", plain.ascii = F)
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | ||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | year_of_birth [integer] | Mean (sd) : 1965.6 (8.3) min < med < max: 1952 < 1965 < 1981 IQR (CV) : 13 (0) | 30 distinct values | 5850 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 2 | annual_income [integer] | Mean (sd) : 59747.9 (47126.5) min < med < max: 18 < 49085 < 641380 IQR (CV) : 52028.5 (0.8) | 3626 distinct values | 5850 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 3 | sex [factor] | 1. Male 2. Female |
|
5850 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 4 | citizenship [factor] | 1. AT 2. EU 3. Other |
|
5850 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 5 | parents_present [factor] | 1. 1 2. 2 3. 3 4. 4 5. 5 |
|
3684 (63.0%) | 2166 (37.0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 6 | adults_home [integer] | Mean (sd) : 2.7 (1.3) min < med < max: 0 < 2 < 12 IQR (CV) : 1 (0.5) | 13 distinct values | 3677 (62.9%) | 2173 (37.1%) | |||||||||||||||||||||||||||||||||||||||||||||
| 7 | children_home [integer] | Mean (sd) : 2.5 (1.6) min < med < max: 1 < 2 < 16 IQR (CV) : 2 (0.6) | 14 distinct values | 3698 (63.2%) | 2152 (36.8%) | |||||||||||||||||||||||||||||||||||||||||||||
| 8 | father_cob [factor] | 1. Don't know 2. Resp. Present Country 3. Other EU country 4. Other European Country |
|
3590 (61.4%) | 2260 (38.6%) | |||||||||||||||||||||||||||||||||||||||||||||
| 9 | father_cit [factor] | 1. Don't know 2. Resp. Present Country 3. Other EU country 4. Other European Country |
|
3622 (61.9%) | 2228 (38.1%) | |||||||||||||||||||||||||||||||||||||||||||||
| 10 | mother_cob [factor] | 1. Don't know 2. Resp. Present Country 3. Other EU country 4. Other European Country |
|
3676 (62.8%) | 2174 (37.2%) | |||||||||||||||||||||||||||||||||||||||||||||
| 11 | mother_cit [factor] | 1. Don't know 2. Resp. Present Country 3. Other EU country 4. Other European Country |
|
3695 (63.2%) | 2155 (36.8%) | |||||||||||||||||||||||||||||||||||||||||||||
| 12 | father_edu [factor] | 1. Don't know 2. None 3. Low 4. Medium 5. High |
|
3700 (63.2%) | 2150 (36.8%) | |||||||||||||||||||||||||||||||||||||||||||||
| 13 | mother_edu [factor] | 1. Don't know 2. None 3. Low 4. Medium 5. High |
|
3660 (62.6%) | 2190 (37.4%) | |||||||||||||||||||||||||||||||||||||||||||||
| 14 | father_occup_stat [factor] | 1. Don't know 2. Employed 3. Self-employed 4. Unemployed 5. Retired etc. 6. Domestic 7. Other inactive |
|
3479 (59.5%) | 2371 (40.5%) | |||||||||||||||||||||||||||||||||||||||||||||
| 15 | mother_occup_stat [factor] | 1. Don't know 2. Employed 3. Self-employed 4. Unemployed 5. Retired etc. 6. Domestic 7. Other inactive |
|
3599 (61.5%) | 2251 (38.5%) | |||||||||||||||||||||||||||||||||||||||||||||
| 16 | father_occup [factor] | 1. Don't know 2. Armed forces 3. Manager 4. Professional 5. Technician 6. Clerical 7. Service 8. Agri 9. Craft 10. Operator 11. Elementary |
|
3379 (57.8%) | 2471 (42.2%) | |||||||||||||||||||||||||||||||||||||||||||||
| 17 | mother_occup [factor] | 1. Don't know 2. Armed forces 3. Manager 4. Professional 5. Technician 6. Clerical 7. Service 8. Agri 9. Craft 10. Operator 11. Elementary |
|
2005 (34.3%) | 3845 (65.7%) | |||||||||||||||||||||||||||||||||||||||||||||
| 18 | father_manag [factor] | 1. Don't know 2. Supervisory 3. Non-supervisory |
|
3426 (58.6%) | 2424 (41.4%) | |||||||||||||||||||||||||||||||||||||||||||||
| 19 | mother_manag [factor] | 1. Don't know 2. Supervisory 3. Non-supervisory |
|
1973 (33.7%) | 3877 (66.3%) | |||||||||||||||||||||||||||||||||||||||||||||
| 20 | tenancy [factor] | 1. Don't know 2. Owner 3. Tenant 4. Free Acc |
|
3637 (62.2%) | 2213 (37.8%) | |||||||||||||||||||||||||||||||||||||||||||||
| 21 | age [numeric] | Mean (sd) : 45.4 (8.3) min < med < max: 30 < 46 < 59 IQR (CV) : 13 (0.2) | 30 distinct values | 5850 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 22 | income [numeric] | Mean (sd) : 59310.6 (44340.9) min < med < max: 18 < 49085 < 249601.6 IQR (CV) : 52028.5 (0.7) | 3611 distinct values | 5850 (100.0%) | 0 (0.0%) | |||||||||||||||||||||||||||||||||||||||||||||
| 23 | inc_log [numeric] | Mean (sd) : 10.7 (1) min < med < max: 2.9 < 10.8 < 12.4 IQR (CV) : 1.1 (0.1) | 3611 distinct values | 5850 (100.0%) | 0 (0.0%) |
Generated by summarytools 0.9.8 (R version 4.0.3)
2021-02-23
Summary Statistics All Countries
Before we start with our empirical analysis using the synthetic EU Silc data for six different countries we first want to compare these countries to analyze how they differ. For each of the countries we create a data frame which contains information regarding the country’s average equalized income, the standard deviation of the annual income, the sample size of the data and the Gini coefficients. After obtaining the individual country data frames we join them together in order to plot the differences in the Gini coefficients in the respective countries.
Looking at the summary statistics we find the highest average equalized income in Denkmark (82525.95) and the lowest in Latvia (15121.82). As can be seen also from the plot Denmark is the country with the lowest Gini coefficient of 0.30, the highest Gini persists in Latvia (0.40). Given that especially the nordic countries are characterized by a high living standard finding lower inequality in Denmark is not surprising. The Gini coefficient for Austria in the synthetic dataset is now 0.36 which is a lot higher compared to the Gini coefficient of 0.25 we calculated earlier from the real Statistics Austria data set. All Ginis are slightly higher than the World Bank estimates for 2017.
## Joining, by = c("Country", "Sample Size", "Avg. Equ.Income", "Std. dev.", "Gini")
## Joining, by = c("Country", "Sample Size", "Avg. Equ.Income", "Std. dev.", "Gini")
## Joining, by = c("Country", "Sample Size", "Avg. Equ.Income", "Std. dev.", "Gini")
## Joining, by = c("Country", "Sample Size", "Avg. Equ.Income", "Std. dev.", "Gini")
## Joining, by = c("Country", "Sample Size", "Avg. Equ.Income", "Std. dev.", "Gini")
| Country | Sample Size | Avg. Equ.Income | Std. dev. | Gini |
|---|---|---|---|---|
| AT | 5850 | 59747.89 | 47126.46 | 0.3986664 |
| FR | 9631 | 51978.37 | 48471.23 | 0.3908803 |
| DK | 3995 | 62564.74 | 44636.32 | 0.3602791 |
| ES | 15109 | 38950.95 | 31417.98 | 0.3982374 |
| FI | 7284 | 51691.63 | 37396.74 | 0.3676934 |
| LV | 6045 | 13544.87 | 11185.29 | 0.4246783 |
Unfortunately the data is of very bad qualtiy, which is why we chose to showcase only the results obtained from the conditional inference forest for the variable importance comparison across countries. First, we split the data into training and test data, and define our formula for estimation.
set.seed(78910)
formula_2 <- inc_log ~ sex + parents_present + adults_home + children_home + father_cob + father_cit + mother_cob + mother_cit + father_edu + mother_edu + father_occup_stat + mother_occup_stat + father_occup + mother_occup + father_manag + mother_manag + tenancy
# Austria
AT_equality_data <- AT_equality_data %>%
mutate(train_index = sample(c("train", "test"), nrow(AT_equality_data), replace=TRUE, prob=c(0.67, 0.33)))
AT_train <- AT_equality_data %>% filter(train_index=="train")
AT_test <- AT_equality_data %>% filter(train_index=="test")
# France
FR_equality_data <- FR_equality_data %>%
mutate(train_index = sample(c("train", "test"), nrow(FR_equality_data), replace=TRUE, prob=c(0.67, 0.33)))
FR_train <- FR_equality_data %>% filter(train_index=="train")
FR_test <- FR_equality_data %>% filter(train_index=="test")
# Spain
ES_equality_data <- ES_equality_data %>%
mutate(train_index = sample(c("train", "test"), nrow(ES_equality_data), replace=TRUE, prob=c(0.67, 0.33)))
ES_train <- ES_equality_data %>% filter(train_index=="train")
ES_test <- ES_equality_data %>% filter(train_index=="test")
# Denmark
DK_equality_data <- DK_equality_data %>%
mutate(train_index = sample(c("train", "test"), nrow(DK_equality_data), replace=TRUE, prob=c(0.67, 0.33)))
DK_train <- DK_equality_data %>% filter(train_index=="train")
DK_test <- DK_equality_data %>% filter(train_index=="test")
# Finland
FI_equality_data <- FI_equality_data %>%
mutate(train_index = sample(c("train", "test"), nrow(FI_equality_data), replace=TRUE, prob=c(0.67, 0.33)))
FI_train <- FI_equality_data %>% filter(train_index=="train")
FI_test <- FI_equality_data %>% filter(train_index=="test")
# Latvia
LV_equality_data <- LV_equality_data %>%
mutate(train_index = sample(c("train", "test"), nrow(LV_equality_data), replace=TRUE, prob=c(0.67, 0.33)))
LV_train <- LV_equality_data %>% filter(train_index=="train")
LV_test <- LV_equality_data %>% filter(train_index=="test")
First, we chose to show the estimation results for Austria in 2011 with the synthetic data, in order to compare those results to the ones obtained prior.
AT_cf <- cforest(formula_2, AT_equality_data, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.9), ytrafo = NULL, scores = NULL, ntree = 500L, perturb = list(replace = T, fraction = 0.8))
AThat_cf <- predict(AT_cf, newdata = AT_test, OOB = TRUE, type = "response")
mean((AT_equality_data$inc_log - predict(AT_cf))^2) #MSE
## [1] 0.9080596
cor(predict(AT_cf, newdata=NULL),AT_equality_data$inc_log)^2 #R-sq
## [1] 0.04091568
importance_cf <- data.frame(varimp(AT_cf, mincriterion = 0, OOB = TRUE)) # Variable importance
names(importance_cf) <- "importance"
importance_cf$var_name = rownames(importance_cf)
importance_cf <- importance_cf %>%
arrange( desc(importance)) %>%
mutate(Country = "AT")
varimpo2 <- ggplot(importance_cf, aes(x = var_name, y = importance)) +
geom_pointrange(shape = 21, colour = "black", fill = "white", size = 3, stroke = 1, aes(ymin = 0, ymax = importance)) +
scale_x_discrete(limits = importance_cf$var_name[order(importance_cf$importance)]) +
labs(title = "Conditional Forest variable importance - Austria 2011", x = "", y = "Mean decrease in sum of squared residuals") +
coord_flip() +
theme_light() +
theme(axis.line = element_blank(), panel.border = element_blank(), panel.grid.major.y=element_blank())
ggplotly(varimpo2)
DESCRIBE RESULTS
Next we repeat the procedure for the remaining countries:
FR_cf <- cforest(formula_2, FR_equality_data, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.9), ytrafo = NULL, scores = NULL, ntree = 500L, perturb = list(replace = T, fraction = 0.8))
importance_cf_FR <- data.frame(varimp(FR_cf, mincriterion = 0, OOB = TRUE))
names(importance_cf_FR) <- "importance"
importance_cf_FR$var_name = rownames(importance_cf_FR)
importance_cf_FR <- importance_cf_FR %>% arrange(desc(importance)) %>% mutate(Country = "FR")
FI_cf <- cforest(formula_2, FI_equality_data, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.9), ytrafo = NULL, scores = NULL, ntree = 200L, perturb = list(replace = T, fraction = 0.8))
#
importance_cf_FI <- data.frame(varimp(FI_cf, mincriterion = 0, OOB = TRUE))
names(importance_cf_FI) <- "importance"
importance_cf_FI$var_name = rownames(importance_cf_FI)
importance_cf_FI <- importance_cf_FI %>% arrange(desc(importance)) %>% mutate(Country = "FI")
DK_cf <- cforest(formula_2, DK_equality_data, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.9), ytrafo = NULL, scores = NULL, ntree = 200L, perturb = list(replace = T, fraction = 0.8))
#
importance_cf_DK <- data.frame(varimp(DK_cf, mincriterion = 0, OOB = TRUE))
names(importance_cf_DK) <- "importance"
importance_cf_DK$var_name = rownames(importance_cf_DK)
importance_cf_DK <- importance_cf_DK %>% arrange(desc(importance)) %>% mutate(Country = "DK")
LV_cf <- cforest(formula_2, LV_equality_data, na.action = na.pass, control = ctree_control(teststat = "quadratic", testtype = "Bonferroni", mincriterion = 0.9), ytrafo = NULL, scores = NULL, ntree = 500L, perturb = list(replace = T, fraction = 0.8))
importance_cf_LV <- data.frame(varimp(LV_cf, mincriterion = 0, OOB = TRUE))
names(importance_cf_LV) <- "importance"
importance_cf_LV$var_name = rownames(importance_cf_LV)
importance_cf_LV <- importance_cf_LV %>% arrange(desc(importance)) %>% mutate(Country = "LV")
Variable Importance Countries
df <- full_join(importance_cf, importance_cf_FR)
## Joining, by = c("importance", "var_name", "Country")
df <- full_join(df, importance_cf_FI)
## Joining, by = c("importance", "var_name", "Country")
df <- full_join(df, importance_cf_DK)
## Joining, by = c("importance", "var_name", "Country")
df <- full_join(df, importance_cf_LV) %>% group_by(Country)
## Joining, by = c("importance", "var_name", "Country")
dfvarimp <- ggplot(df, aes(x = var_name , y = importance, shape = Country, color=Country)) +
geom_point() +
scale_x_discrete(limits = importance_cf$var_name[order(importance_cf$importance)]) +
labs(title = "Conditional Forest variable importance - Country Comparison", x = "", y = "Variable importance") +theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
ggplotly(dfvarimp)
DESCRIBE PLOT
In this paper we have shown and described the detailed process of estimating and predicting inequality of opportunity using conditional inference trees and forests. It was our focus, to point out the difficulties and restrictions encountered when working with microdata. Our outcome variable in the 2019 dataset did not reflect actual values and the whole 2011 dataset only contained synthetic data. Thus, we were not able to precisely replicate the work of (2018). However, we have shown that it is possible to use ML techniques to estimate inequality of opportunity. The clear benefit of conditional inference trees is that they give us a glimpse into what the structure of inequality of opportunity might look like in a country. Our results are not confident of the performance quality of either ML application employed. Thus, we ultimately cannot confirm the sentiment, expressed by (2018) that either application is a useful tool for presenting results on inequality of opportunity. # References
AT_Ctree <- ctree(formula_2, data = AT_train, alpha = 0.05, maxdepth = 5)
AT_Ctree
##
## Model formula:
## inc_log ~ sex + parents_present + adults_home + children_home +
## father_cob + father_cit + mother_cob + mother_cit + father_edu +
## mother_edu + father_occup_stat + mother_occup_stat + father_occup +
## mother_occup + father_manag + mother_manag + tenancy
##
## Fitted party:
## [1] root
## | [2] mother_cob in Resp. Present Country, Other EU country
## | | [3] father_edu in Don't know, None, Low
## | | | [4] father_edu in Don't know, None
## | | | | [5] father_edu in Don't know: 10.585 (n = 112, err = 83.2)
## | | | | [6] father_edu in None: 10.055 (n = 20, err = 37.4)
## | | | [7] father_edu in Low
## | | | | [8] father_cob in Don't know, Resp. Present Country, Other EU country: 10.672 (n = 1070, err = 918.9)
## | | | | [9] father_cob in Other European Country: 10.510 (n = 149, err = 141.5)
## | | [10] father_edu in Medium, High
## | | | [11] father_cit in Don't know, Resp. Present Country, Other EU country
## | | | | [12] mother_edu in Don't know, Low
## | | | | | [13] mother_edu in Don't know: 10.110 (n = 17, err = 54.2)
## | | | | | [14] mother_edu in Low: 10.731 (n = 1009, err = 808.1)
## | | | | [15] mother_edu in None, Medium, High: 10.799 (n = 778, err = 619.2)
## | | | [16] father_cit in Other European Country: 10.552 (n = 240, err = 330.1)
## | [17] mother_cob in Other European Country: 10.487 (n = 517, err = 567.2)
##
## Number of inner nodes: 8
## Number of terminal nodes: 9
plot(AT_Ctree, type = "simple", gp = gpar(fontsize = 6),
inner_panel=node_inner,
ip_args=list(abbreviate = FALSE,id = FALSE), main = "Conditional Inference Tree for Austria 2011")
AT_ctree2 <- ctree(formula_2, data = AT_train, control = ctree_control(testtype = "Bonferroni", mincriterion = 0.99, maxdepth = 5))
AT_ctree2
##
## Model formula:
## inc_log ~ sex + parents_present + adults_home + children_home +
## father_cob + father_cit + mother_cob + mother_cit + father_edu +
## mother_edu + father_occup_stat + mother_occup_stat + father_occup +
## mother_occup + father_manag + mother_manag + tenancy
##
## Fitted party:
## [1] root
## | [2] mother_cob in Resp. Present Country, Other EU country
## | | [3] father_cob in Don't know, Other European Country: 10.531 (n = 403, err = 343.7)
## | | [4] father_cob in Resp. Present Country, Other EU country: 10.711 (n = 2992, err = 2687.9)
## | [5] mother_cob in Other European Country: 10.513 (n = 517, err = 557.5)
##
## Number of inner nodes: 2
## Number of terminal nodes: 3
plot(AT_ctree2, type = "simple",gp = gpar(fontsize = 6),
inner_panel=node_inner,
ip_args=list(abbreviate = FALSE,id = FALSE), main = "Opportunity Conditional Inference Tree for Austria 2011 - Cross Validated with Ctree")
FR_ct <- ctree(formula_2, data = FR_train, control = ctree_control(testtype = "Bonferroni", mincriterion = 0.99, maxdepth = 5)) #Using the suggestion we generate a Conditional Inference Tree and plot it as our final result
plot(FR_ct, type = "simple",gp = gpar(fontsize = 8),
inner_panel=node_inner,
ip_args=list(abbreviate = FALSE,id = FALSE), main = "Opportunity Conditional Inference Tree for France 2011 - Cross Validated")
# we do the control step using the default ctree_control function
LV_ct <- ctree(formula_2, data = LV_train, control = ctree_control(testtype = "Bonferroni", mincriterion = 0.99))
plot(LV_ct, type = "simple", gp = gpar(fontsize = 8),
inner_panel=node_inner,
ip_args=list(abbreviate = FALSE,id = FALSE), main = "Conditional Inference Tree for Latvia 2011 - Cross Validated")
How long did it take to knit the document:
end <- Sys.time()
end-start
## Time difference of 8.162986 mins
Bourguignon, François, Francisco Ferreira, and Marta Menéndez. 2007. “INEQUALITY of Opportunity in Brazil.” Review of Income and Wealth 53 (4): 585–618. https://EconPapers.repec.org/RePEc:bla:revinw:v:53:y:2007:i:4:p:585-618.
Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. Edited by no. Belmont, CA: Wadsworth International Group.
Brunori, Paolo, Paul Hufe, and Gerszon Daniel Mahler. 2018. “The Roots of Inequality: Estimating Inequality of Opportunity from Regression Trees.” Ifo Working Paper Series 252. ifo Institute - Leibniz Institute for Economic Research at the University of Munich. https://ideas.repec.org/p/ces/ifowps/_252.html.
Brunori, Paolo, and Guido Neidhoefer. 2020. “The Evolution of Inequality of Opportunity in Germany: A Machine Learning Approach.” SERIES 01-2020. Dipartimento di Economia e Finanza - Università degli Studi di Bari "Aldo Moro". http://www.seriesworkingpapers.it/RePEc/bai/series/SERIES_WP_01-2020.pdf.
Ferreira, Francisco, and Jérémie Gignoux. 2011. “THE Measurement of Inequality of Opportunity: THEORY and an Application to Latin America.” Review of Income and Wealth 57 (4): 622–57. https://EconPapers.repec.org/RePEc:bla:revinw:v:57:y:2011:i:4:p:622-657.
Roemer, John E. 1998. Equality of Opportunity. Harvard University Press, Oxford.
Roemer, John E, and Alain Trannoy. 2015. “Equality of Opportunity.” In Handbook of Income Distribution, 2:217–300. Elsevier.
Torsten Hothorn, Achim Zeileis, Kurt Hornik. 2006. “Ctree: Conditional Inference Trees: R Vignette.” https://cran.r-project.org/web/packages/partykit/vignettes/ctree.pdf.